Documentation for "Gone with the Wind: Federalism and the Strategic Location of Air Polluters" by James E. Monogan III, David M. Konisky, and Neal D. Woods

This file first describes the contents of every data file necessary to replicate our project. Second, it describes every software program necessary for replication. Third, it describes additional documentation on record.

Sufficient information is present to start from the raw data, reproduce our model-formatted data, and reproduce the models we estimated. Those who simply want to start with the formatted data and reproduce our models only need four files from our Dataverse: the comma-separated data set majorAirAllDist.csv, the comma-separated data set anyPowerPlantDist.csv, the R program lower48control.R, and the R program powerPlantsModel.R. 


DATA FILES
* Air_Majors.csv: EPA geospatial data on major air polluters, including the name and address of each polluter, and coordinates for latitude and longitude for the major air polluter. Accessed April 17, 2012 from the Geospatial Data Download Service (http://www.epa.gov/enviro/geo_data.html).

* anyPowerPlantDirection.csv: Output file from powerPlantsManipulation.R that lists all power plants and large quantity generators of hazardous waste, along with their geospatial location and our forecast of the prevaling wind direction in radians. The forecast is labeled "angle.full", and it consists of two components--the prediction from the general model ("angle.mod") and the prediction from the locally weighted residuals from the training data ("angle.error").

* anyPowerPlantDist.csv: The file for fitting the models before and after the Clean Air Act, this file is the main output of the program powerPlantsManipulation.R. A full description of the variables is listed at the end of this document.

* countyDensity.csv: Population density by county from 2010 U.S. Census. Counties are referenced by four-digit FIPS code. (Downloaded from American FactFinder on June 28, 2015.)

* egrid_allplants.dta: Data for power plants in the United States, including name of the facility, location in latitude and longitude, and the year the first generator went online.

* majorAirAllDist.csv: The primary file for fitting the models, this file is the main output of the program dataManipulation.R. A full description of the variables is listed at the end of this document.

* majorAirDirection.csv: Output file from dataManipulation.R that lists all major air polluters and large quantity generators of hazardous waste, along with their geospatial location and our forecast of the prevaling wind direction in radians. The forecast is labeled "angle.full", and it consists of two components--the prediction from the general model ("angle.mod") and the prediction from the locally weighted residuals from the training data ("angle.error").

* new_public.dta: Simple data file containing the registry ID of the polluters, an indicator for whether the facility was constructed after 2005 ("new"), and an indicator for whether the facility is operated by any level of government ("public").

* sicMerger.csv: Simple data file containing the registry ID of the polluters we study as well as their standard industrial classification (SIC) codes. This allows us to estimate models with industry-referenced subsets.

* statePolluterCovs.csv: Data for the 50 American states including two green index measures (gi_tot and gi_net), a measure of locational economic development (ed_locational), and two measures of environmental interest groups (env_ig and scaledIG). This allows us to test whether features of the state condition the level of free riding.

* tl_2011_us_state.zip: ZIP file containing Census TIGER/Line Shapefiles of the American states. The file dataManipulation.R requires these five files to be in the working directory, without a subdirectory.

* tri_emissions2010.dta: Data from the EPA's 2010 Toxics Release Inventory (TRI). The data set includes air polluters' names and addresses, location in latitude and longitude, and total toxic emissions. Accessed February 7, 2013 from the TRI program (http://www2.epa.gov/toxics-release-inventory-tri-program).

* tsdfs.dta: EPA geospatial data on large quantity generators (LQG) of hazardous waste and hazardous waste treatment, storage, and disposal facilities (TSDF). Data include whether the facility is a LQG or TSDF, the name and address of each facility, and coordinates for latitude and longitude for the facility. Accessed July 5, 2012 from the Geospatial Data Download Service (http://www.epa.gov/enviro/geo_data.html).

* wind1996.pdf: The original PDF document from NOAA entitled, "Climatic Wind Data for the United States." This document contains the prevailing wind direction data from 1930-1996 that we use in the wind kriging model. The document is dated November 1998.

* windAngle.txt: Wind direction data from NOAA at 299 weather stations across the continental United States. These data include the location of each station, the latitude and longitude of the station's location, and the prevailing wind direction averaged from 1930-1996. Prevailing wind direction is reported in compass direction, degrees, and radians.


SOFTWARE CODE
* altKrigCRF.R: A program read in by dataManipulation.R that revises code from the CircSpatial library to produce output that better links with the structure of these data.

* CircSpatialV3.tar.gz: Archived R library by Bill Morphet. This library must be installed in order to perform circular kriging in R, so it is a dependency for the program dataManipulation.R.

* dataManipulation.R: This program works in several stages to measure predictors of interest and format the data for analysis. First, it merges air and hazardous waste pollution information, along with data on pollution emissions. Second, it cleans miscoded data. Third, it fits a kriging model of wind direction over 299 observed sites. Fourth, it uses this model to predict the prevailing wind at each of the 36747 sites we study. Fifth, it measures distance from the site to the downwind, eastern, upwind, and western borders. Sixth, it merges standard industrial classification (SIC) codes into the data set. Seventh, it merges in additional state-level covariates. Eighth, it adds additional site-level covariates. In addition to regularly-available R libraries, this program requires two input programs from our Dataverse: CircSpatialV3.tar.gz must be installed and altKrigCRF.R must be loaded into the working directory. The data files that this document reads in are: tri_emissions2010.dta, Air_Majors.csv, tsdfs.dta, windAngle.txt, countyDensity.csv, the five files in tl_2011_us_state.zip, statePolluterCovs.csv,  and new_public.dta. It outputs several graphs in PDF format, the file majorAirDirection.csv, and the file majorAirAllDist.csv (which is the primary file for analysis). 

* lower48control.R: This program fits a variety of spatial point pattern models: The base models of relative risk of air polluters given distance to border, the models over subsets of industry types, the models that interact distance with state-level covariates, the models that subset by the level of facility toxicity, and a model with no control group focused on the nonhomogenous Poisson process of air polluter location. This program also produces figures of the smoothed intercept term for the basic model of the effect of scaled distance to downwind border, predicted probabilities of major air polluters conditioning on features of the state, the geographic distribution of site locations in 7 different states, and a forest plot of coefficients from various subsets by toxicity level. It reads in the file majorAirAllDist.csv.

* powerPlantsManipulation.R: This program manipulates the data for power plants in order to prepare them for the analysis that subsets based on siting before or after the enactment of the Clean Air Act of 1970. This file reads in egrid_allplants.dta, majorAirAllDist.csv, windAngle.txt, and the five files in tl_2011_us_state.zip. It outputs the files anyPowerPlantDirection.csv and anyPowerPlantDist.csv.

* powerPlantsModel.R: This program estimates three models of the location of power plants (but not other major air polluters) relative to large quantity generators. The first model includes all power plants that went online before 1970. The second model includes all power plants that went online from 1970-1979. The third model includes power plants that went online in 1980 or later. It reads in the file anyPowerPlantDist.csv.

DOCUMENTATION
* CodebookWind.pdf: A codebook describing every variable in the replication files of majorAirAllDist.csv and anyPowerPlantDist.csv.

* gwwAppendix.pdf: An online-only appendix offering additional details of the work reported in the printed article.